The mosaic package provides facilities for putting model fitting into a “mathematical function” framework.
lm() and glm()mosaic::fitModel() provides a named-parameter interface to fitting, both linear and nonlinear.xyplot() and mosaic::plotFun() allow students to display the model functions over the data.Mortality at 20 years of follow-up for smokers and non-smokers.
sample( Whickham, size=5 )
## outcome smoker age orig.id
## 665 Alive No 22 665
## 781 Alive No 32 781
## 1062 Dead No 67 1062
## 956 Dead No 72 956
## 200 Alive No 51 200
smod <- glm( outcome=='Alive' ~ age*smoker, family='binomial',
data=Whickham)
f <- makeFun(smod) # EXTRACT THE FUNCTION
## Warning in inferTransformation(formula(object)): You may need to specify
## transformation to get the desired results.
Evaluation … For a 60-year old …
f( age=60, smoker=c('Yes'))
## 1
## 0.512897
xyplot( outcome=='Alive' ~ age | smoker, data=Whickham,
pch=20, alpha=.2 )
plotFun( f(age=age, smoker="No") ~ age, add=TRUE)
plotFun( f(age=age, smoker="Yes") ~ age, add=TRUE,
col='red', lwd=2)
hmod <- lm( height ~ father, data=subset(Galton,sex=="M"))
Extract the function … just like resid( ) or fitted( )
hfun <- makeFun(hmod)
Plot it … first data, then the function.
xyplot(height ~ father | sex, data=Galton)
plotFun( hfun(x) ~ x, add=TRUE)
mod2 <- lm( height ~ father + sex, data=Galton)
hfun2 <- makeFun(mod2)
xyplot(height ~ father | sex, data=Galton)
plotFun( hfun2(x, sex="F") ~ x, add=TRUE)
plotFun( hfun2(x, sex="M") ~ x, add=TRUE, col='red')
Or …
mod3 <- lm( height ~ father*mother + sex, data=Galton)
hfun3 <- makeFun(mod3)
plotFun( hfun3(father=x, mother=y, sex="F") ~ x&y,
x.lim=c(60,80),y.lim=c(57,73))
plotPoints( mother ~ father, data=Galton, add=TRUE,
pch=20,col='red')
hfun2( father=66, sex=c('M','F'), interval='confidence')
## fit lwr upr
## 1 67.87340 67.59130 68.15550
## 2 62.69736 62.40424 62.99049
hfun2( father=66, sex=c('M','F'), interval='prediction')
## fit lwr upr
## 1 67.87340 63.39609 72.35072
## 2 62.69736 58.21934 67.17539
ggplot(data=Galton, aes(x=father, y=height)) + geom_point() +
aes(colour=sex) + stat_smooth(method=lm) +
theme(legend.position="none") + labs(title="")
Convenient interface via mosaic:mPlot().
One of the objectives of Project MOSAIC is to make connections between statistics and mathematics and computation. The mosaic package offers some features specifically oriented to doing math.
These are suggestions or demonstrations appropriate in teaching statistical modeling to students not afraid of math, e.g. engineers, statistics minors, …
f <- makeFun( 3*x + 2 ~ x)
Read this as, “make a function f that takes \(x\) as a input and produces \(3x + 2\) as the output.
Evaluating the function is a matter of providing an input:
f(4)
## [1] 14
It’s easy to plot functions:
plotFun( 3*x + 2 ~ x, x.lim=range(-5,5) )
Or, if we already have the function f, give \(x\) as an input:
plotFun( f(x) ~ x, x.lim=range(-5,5) )
Functions of multiple variables follow the same scheme:
g <- makeFun( 3*x*y + 2*x -4*y +2 ~ x&y)
g(x=0,y=2)
## [1] -6
plotFun( g(x=x,y=y)~x&y,
x.lim=range(-2,2),y.lim=range(-2,2) )
Or even cross-sections or parametrically …
plotFun( g(x=x, y=0.5) ~ x, x.lim=range(-2,2) )
plotFun( g(x=3*cos(t),y=sin(t))~ t, t.lim=range(-5,5))
Why is this relevant to statistical modeling?
The workhorses of model building in R are lm() and glm(). They are great, but unfortunately they return values that are not what students are used to.
In particular, students are used to lines expressed as \(y = m x + b\), etc.
lm() gives back something different, e.g.
lm( height ~ father, data=Galton )
## (Intercept) father
## 39.1103868 0.3993813
How different this is from \(mx+b\).
And then, what happens when you plot a model …
m = lm( height ~ father, data=Galton )
plot(m)
The lm() function was written by statisticians for statisticians. Confidence intervals and p-values are easy:
coef(summary(m))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.1103868 3.22706259 12.119501 2.055761e-31
## father 0.3993813 0.04658212 8.573704 4.354588e-17
lm() is particularly valuable when there are multiple explanatory variables.
m2 <- lm( height ~ sex + father + mother, data=Galton)
coef(summary(m2))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.3447600 2.74696121 5.586085 3.082284e-08
## sexM 5.2259513 0.14400791 36.289335 5.786616e-178
## father 0.4059780 0.02920696 13.900044 6.525648e-40
## mother 0.3214951 0.03128178 10.277392 1.701771e-23
One strategy is to start with what students already know, and build on that.
height(father) = m * father + bheight(father):
Galton in this case.Variable names are set by the data:
names(Galton)
## [1] "family" "father" "mother" "sex" "height" "nkids"height(father) ~ m * father + bAnd do the fitting:
f <- fitModel( height ~ m*father + b, data=Galton)You plot this the same way as “mathematical” functions:
plotFun( f(father=x)~x, x.lim=range(60,80) )
If you want to talk about residuals, least squares, etc, you can do so. For instance, have students guess the function before fitting it …
xyplot( height ~ father, data=Galton)
myGuess <- makeFun(40 + .35*x ~ x )
plotFun( myGuess(x) ~ x, add=TRUE, col='red')
plotFun( f(father=x) ~ x, add=TRUE, col='green')
Which is better?
mean( (myGuess(father)-height)^2, data=Galton )
## [1] 18.26244
mean( (f(father) - height)^2, data=Galton)
## [1] 11.85077
Students can explore these questions:
f0 <- fitModel(height ~ a + 0*father, data=Galton)
f1 <- fitModel(height ~ a + b*father, data=Galton)
f2 <- fitModel(height ~ a + b*father + c*father^2, data=Galton)
xyplot( height ~ father, data=Galton)
plotFun( f0(father=x)~x, add=TRUE)
plotFun( f1(father=x)~x, add=TRUE, col='red')
plotFun( f2(father=x)~x, add=TRUE, col='green')
Which function is better? (Hint: it doesn’t depend on the data!)
mean( ~ I((f0(father)-height)^2), data=Galton )
## [1] 12.82301
mean( ~ I((f1(father)-height)^2), data=Galton )
## [1] 11.85077
mean( ~ I((f2(father)-height)^2), data=Galton )
## [1] 11.82975
It just isn’t packaged as such. We can do so with mosaic::makeFun()
m1 <- lm( height ~ father, data=Galton)
g1 <- makeFun(m1)
g1( father=65 )
## 1
## 65.07017
While we’re at it …
Let’s do what Galton never could: Include women!
And we can take a liberal view about how the child came to be: the father and the mother multiplying:
mod <- lm( height ~ sex + father*mother, data=Galton )
f <- makeFun( mod )
f( sex='F', father=65, mother=65 )
## 1
## 62.59029
plotFun( f(sex='M', father=x, mother=y) ~ x & y,
x.lim=range(60,80), y.lim=range(60,80))
g <- smoother( height ~ father, data=Galton, degree=1 )
xyplot(height ~ father, data=Galton)
plotFun( g(father=x)~x, add=TRUE, lwd=3 )
plotFun( f1(father=x)~x, add=TRUE, col='red',
lwd=3, alpha=.5)
Where the data are, the two functions are basically the same.
Example: Mortality at 20 years of follow-up for smokers and non-smokers.
sample( Whickham, size=5 )
## outcome smoker age orig.id
## 665 Alive No 22 665
## 781 Alive No 32 781
## 1062 Dead No 67 1062
## 956 Dead No 72 956
## 200 Alive No 51 200
counts <- tally( ~ outcome & smoker, data=Whickham )
chisq.test( counts )
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: counts
## X-squared = 8.7515, df = 1, p-value = 0.003093
And what’s the effect size? In what direction?
tally( outcome ~ smoker, data=Whickham )
## smoker
## outcome No Yes
## Alive 502 443
## Dead 230 139
Really? Smokers had a larger chance of surviving?
Just as there are smoothers, there are function types suited to Dead/Alive outcomes … e.g. logistic regression.
smod <- glm( outcome=='Alive' ~ age*smoker, family='binomial',
data=Whickham)
f <- makeFun(smod)
## Warning in inferTransformation(formula(object)): You may need to specify
## transformation to get the desired results.
For a 60-year old …
f( age=60, smoker=c('Yes','No'))
## 1 2
## 0.5128970 0.5437244
A function is a function, whatever the form!
xyplot( outcome=='Alive' ~ age | smoker, data=Whickham,
pch=20, alpha=.2 )
plotFun( f(age=age, smoker="No") ~ age, add=TRUE)
plotFun( f(age=age, smoker="Yes") ~ age, add=TRUE,
col='red', lwd=2)
actual <- lm(height ~ father+sex, data=Galton)
f <- makeFun( actual )
plotFun(f(x, sex='F')~x, x.lim=range(65,80))
# Confidence interval?
for (k in 1:50) {
trial <- lm(height ~ father+sex, data=resample(Galton) )
f <- makeFun( trial )
print(plotFun(f(x, sex='F')~x, add=TRUE, col='blue', alpha=.5))
1
}
# Null Hypothesis?
for (k in 1:10) {
trial <- lm(height ~ father+shuffle(sex), data=Galton )
f <- makeFun( trial )
print(plotFun(f(x, sex='F')~x, add=TRUE, col='red', alpha=.5))
1
}